[None][fix] write per-rank torch profile traces by GavinZhu-GMI · Pull Request #13536 · NVIDIA/TensorRT-LLM

GavinZhu-GMI · 2026-04-28T01:48:32Z

Summary

PyExecutor reads TLLM_TORCH_PROFILE_TRACE directly and every rank calls torch_profiler.export_chrome_trace() on the same path. At TP/PP/DP > 1 the concurrent writes interleave and the resulting file fails to parse in Chrome tracing / Perfetto (bad control character / unterminated string at the byte where one rank's output overran another's).

Fix: append the rank to the env-provided path before the first use so each rank writes to its own file.

# tensorrt_llm/_torch/pyexecutor/py_executor.py:886
torch_trace_path = os.environ.get(PROFILE_TRACE_ENV_VAR_NAME, None)
if torch_trace_path is not None:
    trace_base, trace_ext = os.path.splitext(torch_trace_path)
    torch_trace_path = f"{trace_base}-rank-{self.global_rank}{trace_ext}"

TLLM_TORCH_PROFILE_TRACE=/tmp/trace.json now produces /tmp/trace-rank-0.json, /tmp/trace-rank-1.json, etc. — same convention SGLang's scheduler_profiler_mixin already uses (user supplies a base, runtime adds the per-rank suffix automatically).

Why this matters

The existing behavior is unusable at any TP > 1 — the trace file is silently corrupted, not detected at write time.
The published docs at https://nvidia.github.io/TensorRT-LLM/performance/perf-analysis.html make no mention of TP > 1 caveats — users following the docs verbatim hit silent corruption.
Same bug, same fix concept was opened as Fix: Prevent trace file overwrite across ranks in torch profiler when using multiple GPUs #9022 in Nov 2025 (clintg6:fix/multi-gpu-torch-profiling). It received only a coderabbitai bot review and was closed by the author 6 days later without human engagement. This PR re-opens with a smaller diff (8 lines vs ~15+ refactor) and fresh validation evidence.

Validation

Reproduced on TRT-LLM 1.3.0rc11 with TP=8 on 8×H200 serving zai-org/GLM-5.1-FP8:

Before patch (single shared path):

$ ls -la /tmp/trtllm-trace.json
-rw-r--r-- 1 dynamo root 26635069 Apr 23 09:22 trtllm-trace.json
$ python3 -c "import json; json.load(open('/tmp/trtllm-trace.json'))"
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 42862 column 6 (char 2452074)

The corrupt byte range contains "name": "void at::native::vectorized_elem truncated mid-string by another rank's { for the next event.

After patch (per-rank paths, same env value, same workload):

$ ls -la /tmp/trtllm-trace-rank-*.json
-rw-r--r-- 1 dynamo root 26635133 Apr 24 02:57 /tmp/trtllm-trace-rank-0.json
-rw-r--r-- 1 dynamo root 26633450 Apr 24 02:57 /tmp/trtllm-trace-rank-1.json
-rw-r--r-- 1 dynamo root 26634910 Apr 24 02:57 /tmp/trtllm-trace-rank-2.json
-rw-r--r-- 1 dynamo root 26634923 Apr 24 02:57 /tmp/trtllm-trace-rank-3.json
-rw-r--r-- 1 dynamo root 26633834 Apr 24 02:57 /tmp/trtllm-trace-rank-4.json
-rw-r--r-- 1 dynamo root 26633499 Apr 24 02:57 /tmp/trtllm-trace-rank-5.json
-rw-r--r-- 1 dynamo root 26634943 Apr 24 02:57 /tmp/trtllm-trace-rank-6.json
-rw-r--r-- 1 dynamo root 26633389 Apr 24 02:57 /tmp/trtllm-trace-rank-7.json

$ python3 -c "import json; print(len(json.load(open('/tmp/trtllm-trace-rank-0.json'))['traceEvents']))"
63079

Distinct sizes confirm no shared clobbering; rank-0 parses with 63,079 events.

Backwards compatibility

TLLM_TORCH_PROFILE_TRACE env name unchanged.
TP=1 behavior changes: file is now <base>-rank-0<ext> instead of <base>. This is the same compromise SGLang made and is the only sane disambiguation if you ever scale to multi-rank.

Test plan

Smoke: TP=8, 500-token decode, 8 distinct files, all parse with json.load
No source patch beyond the 8-line block at py_executor.py:886
No env-var change required by callers
CI

Reviewers

cc @NVIDIA/trt-llm-torch-runtime-devs @byshiue @xxi-nv — re-opening the multi-rank torch-profiler trace fix from #9022 (which went stale without human review) with a smaller diff and concrete reproducer. Would appreciate eyes here so distributed profiling stops silently corrupting traces.

Summary by CodeRabbit

Bug Fixes
- Fixed torch profiler trace export to generate rank-specific filenames in multi-rank environments, preventing file corruption and ensuring Chrome tracing/Perfetto parsing works correctly when tracing is enabled.

coderabbitai · 2026-04-28T01:50:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f52874de-3a8b-4454-96f2-f97ca30bc0fb

📥 Commits

Reviewing files that changed from the base of the PR and between be1f6f5 and 511c041.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py

📝 Walkthrough

Walkthrough

The change adds rank-specific filename handling to torch profiler trace exports. When tracing is enabled via environment variable, the export filename is rewritten to include the global rank identifier, preventing concurrent writes from multiple ranks to a single file.

Changes

Cohort / File(s)	Summary
Torch Profiler Trace Export `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Added rank-specific filename rewriting for torch profiler trace exports to prevent concurrent file write conflicts when `TLLM_TORCH_PROFILE_TRACE` is enabled.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][fix] write per-rank torch profile traces' accurately and concisely summarizes the main change: writing per-rank torch profile trace files instead of a single shared file.
Description check	✅ Passed	The description covers the problem, solution, validation, backwards compatibility, and test plan. However, it lacks explicit sections for Test Coverage and PR Checklist completion as specified in the template.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

achartier · 2026-04-28T02:03:13Z

/bot run

tensorrt-cicd · 2026-04-28T02:10:31Z

PR_Github #45826 [ run ] triggered by Bot. Commit: 511c041 Link to invocation

tensorrt-cicd · 2026-04-28T13:34:50Z

PR_Github #45826 [ run ] completed with state SUCCESS. Commit: 511c041
/LLM/main/L0_MergeRequest_PR pipeline #36010 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

achartier · 2026-04-28T15:41:20Z

/bot run

tensorrt-cicd · 2026-04-28T15:48:23Z

PR_Github #45952 [ run ] triggered by Bot. Commit: 511c041 Link to invocation

tensorrt-cicd · 2026-04-28T20:45:57Z

PR_Github #45952 [ run ] completed with state FAILURE. Commit: 511c041
/LLM/main/L0_MergeRequest_PR pipeline #36107 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

GavinZhu-GMI · 2026-04-29T00:55:44Z

@tensorrt-cicd Cannot see the exact failure of blossom-ci, can you share the details of the pipeline?

achartier · 2026-04-29T03:00:25Z

@tensorrt-cicd Cannot see the exact failure of blossom-ci, can you share the details of the pipeline?

CI flakiness, sorry about that. I'll retry.

achartier · 2026-04-29T03:00:35Z

/bot run

GavinZhu-GMI · 2026-04-29T03:15:40Z

/bot run

achartier · 2026-04-29T14:21:56Z

/bot run

tensorrt-cicd · 2026-04-29T14:28:31Z

PR_Github #46145 [ run ] triggered by Bot. Commit: c9adab8 Link to invocation

tensorrt-cicd · 2026-04-29T15:44:15Z

PR_Github #46145 [ run ] completed with state FAILURE. Commit: c9adab8
/LLM/main/L0_MergeRequest_PR pipeline #36271 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

achartier · 2026-04-29T15:52:03Z

/bot run

tensorrt-cicd · 2026-04-29T15:58:11Z

PR_Github #46170 [ run ] triggered by Bot. Commit: f4529ce Link to invocation

tensorrt-cicd · 2026-04-29T20:22:51Z

PR_Github #46170 [ run ] completed with state SUCCESS. Commit: f4529ce
/LLM/main/L0_MergeRequest_PR pipeline #36291 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

achartier · 2026-04-29T23:55:42Z

/bot run

tensorrt-cicd · 2026-04-30T00:02:31Z

PR_Github #46227 [ run ] triggered by Bot. Commit: f4529ce Link to invocation

tensorrt-cicd · 2026-04-30T04:04:36Z

PR_Github #46227 [ run ] completed with state SUCCESS. Commit: f4529ce
/LLM/main/L0_MergeRequest_PR pipeline #36339 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

PyExecutor reads TLLM_TORCH_PROFILE_TRACE directly and every rank calls torch_profiler.export_chrome_trace() on the same path. When TP/PP/DP > 1, the concurrent writes interleave and the resulting file fails to parse in Chrome tracing / Perfetto (bad control character / unterminated string at the byte where one rank's output overran another's). Append the rank to the env-provided path before the first use so each rank writes to its own file. Matches SGLang's scheduler_profiler_mixin filename convention: the user supplies a base path, the runtime adds the per-rank suffix automatically. Example: TLLM_TORCH_PROFILE_TRACE=/tmp/trace.json now produces /tmp/trace-rank-0.json, /tmp/trace-rank-1.json, etc. Signed-off-by: Gavin.Zhu <gavin.z@gmicloud.ai>

achartier · 2026-04-30T18:45:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-30T18:52:09Z

PR_Github #46427 [ run ] triggered by Bot. Commit: 9c360e0 Link to invocation

tensorrt-cicd · 2026-05-01T18:52:59Z

PR_Github #46427 [ run ] completed with state ABORTED. Commit: 9c360e0

Link to invocation

GavinZhu-GMI requested a review from a team as a code owner April 28, 2026 01:48

GavinZhu-GMI requested a review from achartier April 28, 2026 01:48

github-actions Bot assigned GavinZhu-GMI Apr 28, 2026

achartier approved these changes Apr 28, 2026

View reviewed changes

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 28, 2026

GavinZhu-GMI force-pushed the feature/per-rank-torch-profile-trace branch from 511c041 to c9adab8 Compare April 29, 2026 00:56

achartier force-pushed the feature/per-rank-torch-profile-trace branch from c9adab8 to f4529ce Compare April 29, 2026 15:51

achartier force-pushed the feature/per-rank-torch-profile-trace branch from f4529ce to 9c360e0 Compare April 30, 2026 18:43

Conversation

GavinZhu-GMI commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

Validation

Backwards compatibility

Test plan

Reviewers

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

achartier commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

achartier commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

GavinZhu-GMI commented Apr 29, 2026

Uh oh!

achartier commented Apr 29, 2026

Uh oh!

achartier commented Apr 29, 2026

Uh oh!

GavinZhu-GMI commented Apr 29, 2026

Uh oh!

achartier commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

achartier commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

achartier commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

achartier commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GavinZhu-GMI commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading